DCU-ADAPT: Learning Edit Operations for Microblog Normalisation with the Generalised Perceptron

نویسندگان

  • Joachim Wagner
  • Jennifer Foster
چکیده

We describe the work carried out by the DCU-ADAPT team on the Lexical Normalisation shared task at W-NUT 2015. We train a generalised perceptron to annotate noisy text with edit operations that normalise the text when executed. Features are character n-grams, recurrent neural network language model hidden layer activations, character class and eligibility for editing according to the task rules. We combine predictions from 25 models trained on subsets of the training data by selecting the most-likely normalisation according to a character language model. We compare the use of a generalised perceptron to the use of conditional random fields restricted to smaller amounts of training data due to memory constraints. Furthermore, we make a first attempt to verify Chrupała (2014)’s hypothesis that the noisy channel model would not be useful due to the limited amount of training data for the source language model, i.e. the language model on normalised text.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparative Evaluation of Query Expansion Methods for Enhanced Search on Microblog Data: DCU ADAPT @ SMERP 2017 Workshop Data Challenge

The rapid growth in the availability of social media content posted during emergency situations is creating significant interest in research into how this information can be exploited to assist emergency relief operations and to help with emergency preparedness and in early warning systems. We describe the DCU ADAPT Centre participation in the microblog search data challenge at the SMERP 2017 w...

متن کامل

Automatically Constructing a Normalisation Dictionary for Microblogs

Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e.g. tomorrow for tmrw). We use context information to gen...

متن کامل

Combination of Feature Selection and Learning Methods for IoT Data Fusion

In this paper, we propose five data fusion schemes for the Internet of Things (IoT) scenario,which are Relief and Perceptron (Re-P), Relief and Genetic Algorithm Particle Swarm Optimization (Re-GAPSO), Genetic Algorithm and Artificial Neural Network (GA-ANN), Rough and Perceptron (Ro-P)and Rough and GAPSO (Ro-GAPSO). All the schemes consist of four stages, including preprocessingthe data set ba...

متن کامل

Melbourne Language Group Microblog Track Report

This report outlines the TREC 2011 microblog track submission of the Language Technology Group at The University of Melbourne. The microblog track is an ad–hoc retrieval task over Twitter data with temporally-specified queries, and the requirement that all results must predate the query. Our objective is to establish baseline results for the task and study the relative impact of various factors...

متن کامل

Improved-Edit-Distance Kernel for Chinese Relation Extraction

In this paper, a novel kernel-based method is presented for the problem of relation extraction between named entities from Chinese texts. The kernel is defined over the original Chinese string representations around particular entities. As a kernel function, the Improved-Edit-Distance (IED) is used to calculate the similarity between two Chinese strings. By employing the Voted Perceptron and Su...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015